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Abstract. This paper provides a theoretical support for clustering aspect of the nonnegative 
matrix factorization (NMF). By utilizing the Karush-Kuhn- Tucker optimality conditions, we show 
that NMF objective is equivalent to graph clustering objective, so clustering aspect of the NMF 
has a solid justification. Different from previous approaches which usually discard the nonnegativity 
constraints, our approach guarantees the stationary point being used in deriving the equivalence is 
located on the feasible region in the nonnegative orthant. Additionally, since clustering capability 
of a matrix decomposition technique can sometimes imply its latent semantic indexing (LSI) aspect, 
we will also evaluate LSI aspect of the NMF by showing its capability in solving the synonymy 
and polysemy problems in synthetic datasets. And more extensive evaluation will be conducted by 
comparing LSI performances of the NMF and the singular value decomposition (SVD) — the standard 
LSI method — using some standard datasets. 
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1. Introduction. Nonnegative datasets are everywtiere; from term by document 
matrix induced from a document corpus [U [5], gene expression datasets [3], pixels 
in digital images [1], disease patterns [S], to spectral signatures from astronomical 
spectrometers [S] among others. Even though diverse, they have one thing in common: 
all can be represented by using nonnegative matrices induced from the datasets. This 
allows many well-established mathematical techniques to be applied in order to anayze 
the datasets. 

There are many common tasks associated with these datasets, for example: group- 
ing the similar data points {clustering), finding patterns in the datasets, identifying 
important or interesting features, and finding sets of relevant data points to queries 
{information retrieval). In this paper, we will focus on two tasks: clustering and latent 
semantic indexing — a technique that can be used for improving recall and precision 
of an information retrieval (IR) system. 

1.1. Clustering. Clustering is the task of assigning data points into clusters 
such that similar points are in the same clusters and dissimilar points are in the 
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different clusters. There are many types of clustering, for example supervised/unsu- 
pervised, hierarchical/partitional, hard/soft, and one-way /many- way (two-way clus- 
tering is known as co-clustering or bi-clustering) among others. In this paper, cluster- 
ing term refers to unsupervised, partitional, hard, and one-way clustering. Further, 
the number of cluster is given beforehand. 

The NMF as a clustering method can be traced back to the work by Lee & Seung 
[4]. But, the first work that explicitly demonstrates it is the work by Xu et al. [I] in 
which they show that the NMF outperforms the spectral methods in term of purity 
and mutual information measures for Reuters and TDT2 datasets. 

Clustering aspect of the NMF, even though numerically well studied, is not the- 
oretically well explained. Usually this aspect is explained by showing the equivalence 
between NMF objective to either k-means clustering objective [71 18] or spectral clus- 
tering objective [7]. The problem with the first approach is there is no obvious way to 
incorporate the nonnegativity constraints into k-means clustering objective. And the 
problem with the second approach is it discards the nonnegativity constraints, thus is 
equivalent to finding stationary points on unbounded region. Accordingly, the NMF 
which is a bound-constrained optimization turns into an unbounded optimization, so 
there is no guarantee the stationary point being utilized in proving the equivalence is 
located on the feasible region indicated by the constraints. 

In the first part of this paper, we will provide a theoretical support for cluster- 
ing aspect of the NMF by analyzing the objective at the stationary point using the 
Karush-Kuhn- Tucker (KKT) conditions without setting the KKT multipliers to ze- 
ros. Thus, the stationary point under investigation is guaranteed to be located on the 
feasible region. 

1.2. Latent semantic indexing. Latent semantic indexing (LSI) is a method 
introduced by Deerwester et al. 9 to improve recall and precision of an IR system 
using truncated singular value decomposition (SVD) of the term-by-document matrix 
to reveal hidden relationship between documents by indexing terms that are present 
in the similar documents and weakening the influences of terms that are mutually 
present in the dissimilar documents. The first capability can solve the synonymy — 
different words with similar meaning — problem, and the second capability can solve 
the polysemy — words with multiple unrelated meanings — problem. Thus, LSI not 
only is able to retrieve relevant documents that do not contain terms in the query, 
but also can filter out irrelevant documents that contain terms in the query. 

LSI aspect of the NMF is not well studied. There are some works that discuss the 
relationship between the NMF and probabilistic LSI, e.g., [10l[Tl]. But the emphasize 
is in clustering capability of probabilistic LSI, not LSI aspect of the NMF. Motivated 
by the SVD which is the standard method in clustering and LSI, in the second part of 
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this paper, LSI aspect of the NMF wih be studied, and the results will be compared 
to the results of the SVD. 

2. The nonnegative matrix factorization. The NMF was popularized by 
the work of Lee & Seung [4 in which they showed that this technique can be used 
to learn parts of faces and semantic features of text. Previously, it has been studied 
under the term positive matrix factorization [T^ [T3] . Mathematically, the NMF is a 
technique that decomposes a nonnegative data matrix into a pair of other nonnegative 
matrices: 

A«BC, (2.1) 

where A e R+ = [ai, . . . , aAr] denotes the data matrix, B e R^""^ = [bi, . . . , b^] 
denotes the basis matrix, C G R^^^ = [ci, . . . ,C7v] denotes the coefficient matrix, 
and K denotes the number of factors which usually is chosen so that K <^ min(M, N). 
Note that the definitions of A, B, and C are chosen to simplify the interpretation of 
the NMF. 

To compute B and C, usually eg. 12. H is rewritten into a minimization problem in 
Frobenius norm. 

min J(B,C) = ^I|A-BC||| s.t. B > 0, C > 0. (2.2) 

In addition to the usual Frobenius norm, family of Bregman divergences — which 
Frobenius norm and KuUback-Leibler divergence are part of it — can also be used 
as the distance measures. Detailed discussion on Bregman divergences can be found 
in, e.g., ref. |14j . In this work, we will consider only Frobenius norm. 

3. Limit points of the sequences generated by NMF algorithms. All 

NMF algorithms are formulated in the alternating fashion, fixing one matrix while 
solving the other (the popular Lee & Seung algorithms 15. and their derivatives, e.g., 
[3 O [ini E] also use the alternating strategy, but cannot be represented by generic 
algorithm below). This strategy is employed because the NMF is nonconvex with 
respect to B and C, but is convex with respect to B or C [18]. Thus, the alternating 
strategy transforms NMF problem into a pair of convex subproblems. Transforming a 
nonconvex problem into the corresponding convex subproblems is a common practice 
in optimization researches because: (1) convex optimization is more tractable, (2) 
usually convex methods are more efficient, (3) any local optimum is necessarily a 
global optimum, and (4) the algorithms are easy to initialize [19]. 

Algorithm [1] solves NMF problem in the alternating fashion which will gen- 
erate a solution sequence {B*^'\ C'^'^j^g. This algorithm is known as alternating 
nonnegativity-constrained least square (ANLS) algorithm, and usually is solved by 
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Algorithm 1 Generic algoritlim for the NMF based on ANLS. 
Initialization: C° > 0. 
for ^ = 0, ... do 



B('+i) ^ arg min -\\A - BC^') ||| (3.1) 

B>0 2 

C('+i) ^ arg mini||A-B('+i)C|||, (3.2) 
oo 2 



end for 



decomposing each subproblem into the corresponding nonnegativity-constraincd least 
square (NNLS) problems, where there are many algorithms that guarantee the global- 
optimality of the NNLS problems. The following equations are the NNLS versions of 
the ANLS in algorithm [ij 

b^('+i)^ arg mini||a^-C^(%^|||, Vm (3.3) 
c('+i) ^ arg min i||a„ - B<^'+'^cJl, Vn, (3.4) 

c„>0 ^ 

where is the z-th row of X. 

According to Grippo & Sciandrone [20,, any limit point of {B*^''', C^'^j^Q — 
generated by any ANLS algorithm that optimally solves the convex subproblem eg. 13. II 
and eq. 13.21 — is a stationary point. And such ANLS based NMF algorithms exist, 
e.g., [El [HI EH [221 [231 [21], therefore there is guarantee that the stationary points 
are reachable. And as NNLS is the building block for ANLS, any NNLS algorithm 
that guarantees to find optimal solutions of eq. 13.31 and eq. 13. 4( e.g., p5l [26l [27] can 
also be employed to search for the stationary points. And as will be shown in section 
m NMF objective (eq. 12. 2p implicitly puts upper bounds on the feasible region (the 
lower bounds are explicit: the nonnegativity constraints). Thus the NMF is bound- 
constrained optimization problem, consequently {B('\ C^'-'j^g has at least one limit 
point [TS]. This completes the conditions for any NMF algorithm that optimally 
solves subproblem eg. 13.11 and eg. 13.21 to have convergence guarantee. 

4. Clustering aspect of the NMF. This section is the first part of this paper 
in which a theoretical framework for supporting clustering aspect of the NMF will 
be provided. The strict KKT optimality conditions will be utilized to derive the 
equivalence between NMF objective to graph clustering objective. Unlike previous 
approaches where the KKT multipliers are set to zeros (T] [211 [211 [30] , we will make no 



Clustering and LSI aspects of the NMF 



5 



assumption about the KKT multipliers, thus the stationary point under investigation 
is guaranteed to be located on the feasible region in the nonnegative orthant. 

We will also show that the feasible region is bounded, with the lower bounds 
are explicitly bounded by the nonnegativity constraints, and the upper bounds are 
implicitly bounded by the objective. As stated in section |3l the boundedness of 
the feasible region is the necessary condition for guaranteeing the existence of limit 
point of {B'-'-' , C*-'^}j^Q. And for interpretability reason, the data matrix A will be 
considered as a feature-by-itcm data matrix unless stated differently. 

The following proposition gives the theoretical support for clustering aspect of 
the NMF. 

Proposition 4.1. Minimizing the following objective 

minJ(B,C) = i||A-BC||| (4.1) 
s.t. B > 0, C > 0, 

leads to the feature clustering indicator matrix B and the item clustering indicator 
matrix C. 

Proof. 

||A - BC|||, = tr (A^A - 2CA^B + B'^BCC'^). 
Since A is constant, minimizing J is equivalent to simultaneously optimizing: 

maxtr(CA^B) (4.2) 

B,C 

mintr (B^BCC"^). (4.3) 

B.C 



Note that because tr (XY) < tr (X) tr (Y), minimizing Eg. 14.31 is equivalent to 



nhntr (B^ B) and (4.4) 
mintr (CC^). (4.5) 



The KKT function of objective in eg. 14. H is: 

i(B, C) = J(B, C) - tr (FbB^) - tr (FcC), 

where Tb G ^ and Tc £ M^""-^ are the KKT muhipliers. By applying the KKT 
optimality conditions to L we get: 

VbL = BCC"^ - AC"^ - Tb = (4.6) 
Vci = B'^BC - B^A - rg = 0, (4.7) 
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with complementary slackness: 

Tb B = 0, and C = 0, 

where denotes component- wise multiplications. Eg. 14.61 and eg. 14.71 lead to: 

B = (AC^ + rB)(CC^)"' (4.8) 

C = (B^B)"' (B^A + rg). (4.9) 

Substituting eg. 14.91 to eg. 14.21 leads to: 



max tr ((B^B) (B'^AA^B + T^A^B)) , 

IB 

which is eguivalent to simultaneously optimizing: 



maxtr (B^AA^B) (4.10) 

B 

maxtr (rgA^B) (4.11) 

mintr(B'^B). (4.12) 



Similarly, substituting eg. 14.81 to eg. 14.21 leads to: 

maxtr ((CA'^AC^ + CA^rB)(CC^)"') , 
which is eguivalent to simultaneously optimizing: 

maxtr (CA^AC^) (4.13) 
maxtr (CA^Fb) (4.14) 



c 



mintr(CC-'). (4.15) 
c 

As shown, eg. 14. 121 and eg. 14. 151 recover eg. 14.41 and eg. 23] respectively, so there is no 
need to substituting eg. 14.81 and eg. 14.91 into eg. 14.31 

Now we concentrate on the basis matrix B first. Eg. 14. 101 - 14. 121 give alternative 
objectives to the original NMF objective that contain only B. Note that if we consider 
A to be an affinity matrix induced from bipartite graph t/(A) (which is a reasonable 
thought since any feature-by-item matrix can be modeled by a bipartite graph), then 
^(AA"^) is the feature graph where edge weights describe the similarity between cor- 
responding vertex pairs. So, eg. 14.101 looks like ratio association applied to Q(AA^). 
But without orthogonality constraint B-'^B = I (which is the part of ratio association 
objective), one can optimize eg. 14.101 bv setting B to an infinity matrix. However, 
this violates eg. 14.121 which favours small B. Similarly, one can optimize eg. 14.121 bv 
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setting B to a zero matrix. But again, this violates eq. 14.101 Thus, eq. 14.101 and 
eq. 14.121 create imphcit lower and upper bound constraints on B: < B < Tb- 

For convenience, eq. 14.121 can be restated as: 

mintr (B^B) = mintr (B'^BB^B). (4.16) 

B IB 

By using the fact tr (X-^X) = ||X|||n, eq. 14.161 can be rewritten into: 

> 2 ~ 



mm {\\B^B\\l = Y^{hJh,Y+Y^{hfh 



Therefore, eq. 14. 101 - 14. 12l can be restated as: 



3) 



niaxtr (B^ AA'B) (4.17) 



maxtr(r^A^B) (4.18) 
mm (^(bfbO%^(bfb,)^) (4.19) 

" << ' ^ . ' 

Ji-i 3b2 

s.t. < B < Tb. 

Even though B is now bounded, since there is no column-orthogonality constraint, 
maximizing eq. 14.171 can be easily done by setting each entry of B to the correspond- 
ing largest possible value (in graph term this means to only create one partition on 
C/(AA^)). But this scenario results in a large value of eq. 14.191 which violates the ob- 
jective. Similarly, minimizing eg. 14. 191 to the smallest possible value violates eg. 14.171 
Since minimizing j^i implies minimizing ib2, but not vice versa, simultaneously op- 
timizing eq. 14.171 and eq. 14.191 can be done by setting jf,2 as small as possible and 
balancing with eq. 14.171 This scenario is the relaxed ratio association applied to 
Q{AA^), and as long as vertices in Q{AA^) are clustered, it leads to the grouping 
of related features. 

The remaining problem is eq. 14.181 Since we know nothing about Fc , the best 
bet will be making each entry of A^B as large as possible. This can be done by 
setting B to the largest possible values, but this scenario violates eq. 14.191 So, the 
most reasonable scenario will be making the entries near diagonal region of A"^B as 
large as possible. This can be achieved by using B from previous discussion. As B is 
the feature clustering indicator matrix, multiplying A"^ with B will result in a matrix 
that has larger entries near diagonal region, therefore it can be expected that eq. 14.181 
will have good optimality. Thus simultaneously optimizing eq. 14.171 - 14.191 leads to 
the feature clustering indicator matrix B. 
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By applying the similar approach to the coefficient matrix C, optimizing eg. 14.131 
- I4.15l is equivalent to optimizing: 



where denotes i-th row of C. By following the previous discussion on B, it can be 
shown that as long as vertices in Q{A'^ A) are clustered, simultaneously optimizing 
eg. 14.201 - 14.221 leads to the item clustering indicator matrix C. □ 

4.1. A limitation of the NMF as a clustering method. As shown in the 
proof of proposition 14.11 optimizing NMF objective is equivalent to applying the 
relaxed ratio association to the item graph ^(A-^A) and the feature graph 0{AA^) 
simultaneously. And because in the NMF, clustering membership of each point is 
directly determined by finding the largest projection on the axis of the decomposition 
rank subspace {K subspace) [1], the NMF can only offer good results if the data 
points are linearly separable. 

This is not the case with the spectral clustering, where the memberships are 
indirectly determined by applying k- means clustering on the resulting factors. This 
additional step can sometimes find correct assignments even though the data points 
are not linearly separable. And unfortunately, since the factors produced by the 
NMF are nonnegative and directly point to the cluster's centers [I], applying k-means 
clustering on the factors won't change the clustering assignments. 

The following examples show the limitation of the NMF in clustering linearly 
inseparable data points. And for comparison, the spectral clustering is used. For 
the spectral clustering, we use Ng et al. algorithm (NJW) [31], and for the NMF, 
we use Lee & Seung algorithm (NMFLS) [H], and Kim & Park algorithm (NMFJK) 
pi] . NJW and NMFLS are the standard algorithm for the spectral clustering and the 
NMF respectively, and NMFJK is the NMF algorithm that has convergence guarantee. 
Algorithm [2] describes NJW algorithm, and algorithm [3] describes clustering using the 
NMF. Note that we wrote codes for NJW and NMFLS by ourselves, and use codes 
from the authors website0 for NMFJK. To get the same treatment as in NJW, we 
use the same kernel strategy for NMFLS and NMFJK. The adjustable parameter a 
is learned directly from the datasets, and the results are displayed in figure HTTl 14.21 
and [131 



maxtr (CA'^AC^) 
c 

maxtr (CA^Fb) 



(4.20) 



(4.21) 




(4.22) 



s.t. < C < Yc, 



^http:/ /www. cc.gatech.edu/ jingu/nmf /index. html 
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Algorithm 2 Spectral clustering algorithm by Ng at al. pT] (NJW). 

1. Input: Rectangular data matrix A e M*^^^ with TV data points, ^cluster 
K, and Gaussian kernel parameter a. 

2. Construct symmetric affinity matrix A G R^^^ from A by using Gaussian 
kernel. 

3. Normalize A by A ^ D^^/^AD"^/^ where D is a diagonal matrix with 
Da — "^^-^j ' 

4. Compute the K largest eigenvectors of A, and form X G M^^^ = 
[xi, . . . ,Xif], where is the fc-th largest eigenvector of A. 

5. Normalize every row of X, i.e., Xij ^ Xf^)^/^. 

6. Apply k-means clustering on the row of X to obtain the clustering indicator 
matrix X G M^^-^. 



Algorithm 3 Clustering by using the NMF. 

1. Input: Rectangular data matrix A e M*^^^ with data points, ^cluster 
K, and Gaussian kernel parameter a. 

2. Construct symmetric affinity matrix A G R^^^ from A by using Gaussian 
kernel. 

3. Compute B and C by using NMF algorithm (NMFLS or NMFJK) so that 
A « BC. 

4. Assume C is used, then clustering assignment of data point be 
computed by Xn < — arg;.maxc„, Vn. 



As shown in figure I4.1[ 14.21 and I4.3[ while the spectral clustering can correctly 
find the clustering assignments for all datasets, the NMFs can only compete with 
the spectral clustering for the last dataset which is rather linearly separable. These 
results are in accord with the proof of proposition 23] (that states as long as vertices 
on the feature (item) graph are clustered, optimizing the NMF objective leads to 
the feature (item) clustering indicator matrix). Thus, it seems that as a clustering 
method, the NMF is more similar to k-means clustering or support vector machine 
(SVM) which also can only cluster linearly separable datasets, than to the spectral 
methods, even though both clustering using the NMF and the spectral methods are 
based on matrix decomposition techniques. Accordingly, clustering performances of 
the NMF can probably be improved by using appropriate kernel methods as in k- 
means clustering and SVM. 

4.2. Experimental results. The experiments are conducted to evaluate the 
performances of the NMF as a clustering method. All algorithms are developed in 
GNU Octave under linux platform using a notebook with 1.86 GHz Intel processor 
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(a) a = 0.05 (b) a = 0.1 (c) a = 0.1 (d) a = 0.2 



Fig. 4.1. Clustering linearly inseparable datasets using NJW. 




(a) a = 0.3 (b) a = 0.5 (c) a = 0.4 (d) a = 0.5 

Fig. 4.2. Clustering linearly inseparable datasets using NMFLS. 

and 2 GB RAM. Reuters-21578 document corpu^, the standard dataset for testing 
learning algorithms and other text-based processing methods, is used for this purpose. 
This dataset contains 21578 documents (divided into 22 files with each file contains 
1000 documents and the last file contains 578 documents) with 135 topics created 
manually with each document is assigned to one or more topics based on its content. 
The dataset is available in SGML and XML format, we use the XML version. We use 
all but the 18*'' file because this file is invalid both in its SGML and XML version. 
We use only documents that belong to exclusively one class (we use "classes" for 
refeering to the original grouping, and "clusters" for referring to groups resulted from 
the clustering algorithms). 

Further, we remove the common English stop wordfH, stem the remaining words 
using Porter stemmer [32], and then remove words that belong to only one document. 
And also, we normalize the term-by document matrix A by: A -s— AG^^^"^ where 
D = diag(A^Ae) as suggested by Xu et al. [1]. We form test datasets by combining 
top 2, 4, 6, 8, 10, and 12 classes from the corpus. Table I4?l1 summarizes the statistics 
of these test datasets, where #doc, #word, %nnz, max, and min refer to the number 



^http: / /kdd. ics.uci.edu/databases/reuters21578/reuters21578. html 
■^http: / /snowball. tartarus.org/algorithms/cnglish/stop.txt 
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(a) a = 0.8 (b) a = 0.6 (c) a = 0.4 (d) a = 0.4 



Fig. 4.3. Clustering linearly inseparable datasets using NMFJK. 



Table 4.1 
Statistics of the test datasets. 



The data 


#doc 


T^word 


%nnz 


max 


min 


Reuters2 


6090 


8547 


0.363 


3874 


2216 


Reuters4 


6797 


9900 


0.353 


3874 


333 


Reuters6 


7354 


10319 


0.347 


3874 


269 


ReutersS 


7644 


10596 


0.340 


3874 


144 


Reuters 10 


7887 


10930 


0.336 


3874 


114 


Reuters 12 


8052 


11172 


0.333 


3874 


75 



of document, the number of word, percentage of nonzero entry, maximum cluster size, 
and minimum cluster size respectively. And table IT2l gives the sizes (#doc) of these 
top 12 classes. 

As shown in table HTTl a rectangular word-by-document matrix A, where aij de- 
notes the (weighted) frequency of word i in document j, can be induced from each 
dataset (Reuters2, . . ., Reutersl2). Thus we have two options: either by directly ap- 
plying co-clustering on bipartite graph Q (A) for simultaneously finding the word and 
document clustering, or by first transforming Q (A) into the corresponding unipartite 
graph t?(<i>(A^, A)) using kernel function $, and then applying previously discussed 
clustering methods (algorithm [2] and [3|) for finding the document clustering. Employ- 
ing kernel methods is unheard in document clustering researches (probably because 
of the sizes of the datasets), thus the co-clustering style will be employed instead. 
And actually, this is the most common way in using the NMF for clustering purpose 

[DIlllllSllllTlIISlllIllllllllMlllHllMllSnilMlIM]- 

For comparison, we will employ the spectral co-clustering on Q (A) which is com- 
puted by finding the first K singular vectors of A. And because only reference classes 
for documents are available, we will only evaluate document clustering performances. 
The following theorem gives a theoretical support for the using of the SVD in mul- 
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Table 4.2 
Sizes of the top 12 classes. 



class 


1 


2 


3 


4 


5 


6 


#doc 


3874 


2216 


374 


333 


288 


269 


class 


7 


8 


9 


10 


11 


12 


#doc 


146 


144 


129 


114 


90 


75 



ticlass spectral co-clustering (more detailed discussion on this topic can be found in 
ref. [3S]), and algorithm 2] and [S] summarize the document clustering using the SVD 
and the NMF respectively. 



Theorem 4.2. The optimal value of the following problem: 

tr(X^RY), 



max 

xrx=YTY=Ii, 



(4.23) 



is equal to J2^=i'^k if 



where R G 



■^MxN 



X = [xi, . . . , Xi^]Q, and 
Y-[yi,...,yx]Q 

denotes a full rank rectangular complex matrix with singular values 



min(Af,W) 



>0,0<K< min(M,iV), X G 



and Y G 



denote 



column orthogonal matrices, Xfe and (k G [1,^]^ respectively denote k-th left and 
right singular vectors correspond to ak, and Q G C^^^ denotes an arbitrary unitary 
matrix. 

Proof. Eg. 14.231 can be rewritten as: 



max — tr 

XTX=Y^Y=Ik' 2 



X 


T 


R 




X 


Y 




R^ 




Y 



V 



Since '5' is a full rank Hermitian matrix, by the Ky Fan theorem (shown in theorem 
73l below*) ■ the global optimum solution is given by the first K eigenvectors of '4': 



Q 



Xfe 

yfe 



X 




Xl, . 


• ,XA' 


Y 






■ , YK 



Therefore, 



R 

R^ 



Xfe 

yfe 
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where k ^ [1,K] and Afe denotes fc-th eigenvalue of '4'. Then, 

Ryfe = AfeXfe, and 

where and yfc denote the left and right singular vectors associated with singular 
value Afe(= at) of R. □ 

Theorem 4.3 (Ky Fan |36i 137] V The optimal value of the following problem: 

max tr(X^HX) 

XTX=Ijf 

is equal to if 

X = [ui, . . .,uk]Q, 

where H G ([^nxn ^g^g^gg ^ jy^n rank Hermitian matrix with eigenvalues Ai > . . . > 
AatGR, l<if<A^, Xg i^nxk dgriotes a column orthogonal matrix, Ik denotes 
a K X K identity matrix, Uk G denotes k-th eigenvector corresponds to Xk, and 
Q G C^^^ denotes an arbitrary unitary matrix. 

Note that, even though theoretically Xk can be chosen so that Afc = crfe Vfc, 
numerically Xk can be negative. And also, numerically X & Y constructed using 
eigenvectors of ^ can be different from using singular vectors of R. Therefore one 
should always use the SVD for computing X and Y. And for convenience, we assume 
R to be of full rank. The similar result can be derived for non full rank R. 

Theorem 14.21 gives the theoretical support for directly applying graph cuts to the 
bipartite graph ^(A) to get simultaneous row and column clustering or also known as 
(multiclass) spectral co-clustering. The following gives the objective of the multiclass 
spectral co-clustering: 

max tr(X'^AY), 

X^X=YrY=lA' 

where X G M+ ""-^ and Y G M^^"-^ denote the row and column clustering indicator 
matrices respectively. By relaxing the nonnegativity constraints, X and Y can be 
found by computing the first K left and right singular vectors of A. 

There are some standard metrics in evaluating clustering quality. The most com- 
monly used metrics are mutual information, entropy, and purity. We will use these 
metrics together with an additional metric, Fmeasure. In the following, the definitions 
of these metrics are outlined. 

Mutual information (MI) measures dependency between the clusters produced by 
the algorithms and the reference classes. The higher the MI, the most related the 
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Algorithm 4 Document clustering using the SVD. 

1. Input: Rectangular word-by-document matrix A S R*'^^^, and ^cluster 
K. 

2. Normalize A by: A ^ AD^^^^ where D = diag(A'^Ae). 

3. Compute the first K right singular vectors of A, and form V e K^^^ = 
[vi, . . . , v/f ], where is the A:-th right singular vector of A. 

4. Apply k-means clustering on rows of V to obtain document clustering 
indicator matrix V G R^^^. 



Algorithm 5 Document clustering using the NMF. 

1. Input: Rectangular word-by-document matrix A G M*^^^, and #cluster 
K. 

2. Normahze A by: A AD~^/^ where D = diag(A^Ae). 

3. Compute C by using NMF algorithm (NMFLS or NMFJK) so that A « 
BC. 

4. Compute clustering assignment of n-th document by: Xn = 
arg^ maxc„, Vn. 



clusters with the classes, and therefore the better the clustering will be. It is shown 
that MI is a superior measure than purity and entropy |38| because it is tolerant to 
the difference between #cluster and #class. MI is defined with the following formula: 



where r and s denote the r-th cluster and s-th class respectively, p{r, s) denotes the 
joint probability distribution function of the clusters and the classes, and p{r) and p(s) 
denote the marginal probability distribution functions of the clusters and the classes 
respectively. Note that because of inconsistency in the formulation of normalized MI 
(a more commonly used metric) in the literatures, we use MI instead. Accordingly, 
Mi's values are comparable only for the same dataset. 

Entropy addresses the composition of classes in a cluster. It measures uncertainty 
in the cluster, thus the lower the entropy, the better the clustering will be. Unlike MI, 
if there is discrepancy between ^cluster and #class, entropy won't be very indicative 
about the the clustering quality. Entropy is defined with the following: 





where N is the number of samples (^doc for document clustering), Crs denotes the 
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number of samples in r-th cluster that belong to s-th class, and Cr denotes the size of 
r-th cluster. 

Purity is the most commonly used metric. It measures the percentage of the 
dominant class in a cluster, so the high the better. As in entropy, purity is also 
sensitive to the discrepancy between ^cluster and #class. Purity is defined with: 

1 « 

purity — — y ^ max Crs- 

r— 1 

And Fmeasure combines two concept in IR: recall and precision. Recall measures 
the proportion of the retrieved relevant documents to all relevant documents, and 
precision measures the proportion of the retrieved relevant documents to all retrieved 
documents. In the context of assessing clustering quality, Fmeasure is defined with 



1 x ^ ^ precisiour x recallr 

t measure = — > tr, tr — 2 — , 

it ^-^ precisioUr + recallr 

r—l 

where precisiorir and recallr denote the precision and recall of r-th cluster. 

Clustering results of the SVD and the NMFs are shown in table I4.3H4.6I And 
time comparisons are given in table 14.71 with times for the SVD are the sum of 
SVD computational times and the times for performing k-means to obtain clustering 
assignments, and times for the NMF are simply the times for performing the NMF 
on the data matrices. Because we use SVD built-in function that is written in C and 
highly optimized, the computational times of the SVD are not really comparable to 
the computational times of the NMF algorithms which are written in Matlab/Octave 
scripts. 

As shown in table 14. 3)44. 6[ in general, NMFJK performs as good as the SVD for 
all the metrics with NMFJK tends to be better for datasets with smaller ^clusters 
and the SVD for datasets with bigger ^clusters. Unfortunately, NMFLS which is the 
most popular NMF algorithm seems to only be able to give moderate results. The 
convergence guarantee of NMFJK can probably have some role here as converged algo- 
rithms usually can approximate the original matrices better than algorithms without 
convergence guarantee [T71 [23] . 

The computational times of NMFJK seems to be promising as it is faster than 
NMFLS for all datasets. Note that since NMFLS and NMFJK are written in Mat- 
lab/Octave script, improving the computational performances of these algorithms is 
highly possible. And according to Albright et al. [40,, some highly optimized NMF 
algorithms can be faster than SVD algorithms. 
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Table 4.3 

Average mutual information over 10 trials. 



Data 


SVD 


NMFLS 


NMFJK 


Reutcrs2 

Reutcrs4 

ReutersG 

ReutersS 

ReuterslO 

Reuters 12 


0.4951610991 

0.7345796343 
0.S367S7943S 
1.0342492298 
1.1754008483 
1.0S1231305S 


0.4039195065 
0.6287861424 
0.7945867871 
0.9228548694 
1.0415397095 
1.1325663319 


0.4825151487 
0.7482158671 
0.9763402437 
1.0110485952 
1.1588735544 
1.2069251441 


Table 4.4 
Average entropy over 10 trials. 


Data 


SVD 


NMFLS 


NMFJK 


Reuters2 

Reuters4 

ReutersG 

ReutersS 

ReuterslO 

Reutersl2 


0.4506918576 

0.3491263S43 
0.3675S351S4 
0.3185428072 
0.2957115633 
0.333S5251S6 


0.5419334502 
0.4020231303 
0.3839091543 
0.3556742607 
0.3360077814 
0.3195329752 


0.463337808 
0.3423082679 
0.3135973194 
0.3262763521 
0.3006867745 
0.2987911092 


Table 4.5 
Average purity over 10 trials. 


Data 


SVD 


NMFLS 


NMFJK 


Reuters2 

Reuters4 

Reuters6 

ReutersS 

ReuterslO 

Reutersl2 


0.S623973727 
0.8394880094 
0.68180582 
0.8178963893 
0.786103715 
0.6S3S05265S 


0.821543514 
0.7941739003 
0.7451047049 
0.7490580848 
0.7312032458 
0.7387729757 


0.8688505747 

0.8234515227 
0.8042697852 
0.7780612245 
0.7769747686 
0.7663686041 


Table 4.6 
Average Fmeasure over 10 trials. 


Data 


SVD 


NMFLS 


NMFJK 


Reuters2 

Reuters4 

Reuters6 

ReutersS 

ReuterslO 

Reuters 12 


0.S595171797 
0.62552025S1 
0.64S7551603 

0.5043941779 
0.516367264 

0.4437491506 


0.8190358778 
0.5615436865 
0.4622471694 
0.4040827621 
0.3800132312 
0.3567059141 


0.865279454 
0.6960413891 
0.6488871315 

0.4680952898 
0.4842865587 
0.4333021978 
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Table 4.7 

Average computational times over 10 trials (second). 



Data 


SVD 


NMFLS 


NMFJK 


Reuters2 


4.675 


77.27 


65.45 


Reuters4 


6.315 


108.8 


86.32 


Reuters6 


14.01 


134.0 


105.1 


ReutersS 


18.17 


158.4 


128.3 


Reuters 10 


19.98 


834.7 


452.2 


Reuters 12 


21.23 


1249 


775.8 



5. LSI aspect of the NMF. This section is the second part of this paper. 
Here, we will first describe LSI aspect of the NMF by showing its capability in solving 
synonymy and polysemy problems in some synthetic datasets given that the semantic 
structures allow the problems to be revealed, and then evaluate this aspect more 
extensively by comparing the results with results of the standard LSI method — the 
(truncated) SVD — using real datasets. 

5.1. Synonymy problems. Synonyms are different words with similar or al- 
most similar meaning, for example {university, college, institute}, {female, girl, wo- 
man}, and {book, novel, biography} each is a set of synonyms. For improving recall 
and precision of an IR system, it is expected that the system is able to recognize the 
synonyms. This task is usually done by approximating the original word-by-document 
matrix with its truncated SVD version |9l |4T] . 

The synthetic dataset in table 15.11 (taken from ref . ^42j ) shows the synonymy 
problems in which Mark Twain &; Samuel Clemens refer to the same person, and 
purple & colour are closely related. As shown, Mark Twain & Samuel Clemens are not 
recognized as the same person; and similarly, purple & colour also are not recognized 
to be related. Accordingly, if a query q containing {mark, twain} (q'^ = [1 1 00 0]) 
is made into the original matrix A, then q-^A — [30,0,20,0,0]. So, only Docl and 
Doc3 are retrieved, and Doc2 is lost. Similarly, if a query containing {colour} is made, 
then only Doc4 will be retrieved, and Doc5 will be lost (note that even though purple 
and colour can have different meanings, according to this example they are highly 
related, thus any query containing either one of them is expected to retrieve Doc4 
and Doc5). 

The synonymy problem can be resolved using LSI technique as long as there is a 
path that chains them together given that the path is close enough [43] . For example 
in table 15.11 mark & twain are connected to samuel & clemens through Doc3. So, 
there is a path that connects them, and it happens that the distance is close. Thus 
we can expect that LSI using the SVD will be able to reveal this hidden relationship. 



A. Mirzal 



Table 5.1 

Dataset for describing synonymy problems. 



Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


mark 


15 














twain 


15 





20 








Samuel 





10 


5 








Clemens 





20 


10 








purple 











20 


10 


colour 











15 









Table 5.2 






LSI using the SVD jar 


detecting synonyms. 


Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


mark 


3.7 


3.5 


5.5 


-e 


-e 


twain 


11 


10 


16 


-e 


-e 


Samuel 


4.1 


3.9 


6.1 


-e 


-e 


clemens 


8.3 


7.8 


12 


-e 


-e 


purple 


-e 


-e 


-e 


21 


7.1 


colour 


-e 


-e 


-e 


13 


4.5 






Table 5.3 






LSI using the NMF (NMFLS) for detecting synonyms. 


Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


mark 


3.72 


3.50 


5.45 








twain 


11.0 


10.4 


16.2 








Samuel 


4.15 


3.90 


6.08 








clemens 


8.29 


7.79 


12.1 








purple 











21.0 


7.08 


colour 











13.5 


4.55 






Table 5.4 






LSI using the NMF (NMFJK) for for detecting synonyms 


Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


mark 


3.14 


2.95 


4.60 








twain 


9.27 


8.71 


13.6 








Samuel 


3.50 


3.29 


5.13 








clemens 


7.00 


6.58 


10.3 








purple 











17.3 


5.83 


colour 











11.1 


3.74 



Clustering and LSI aspects of the NMF 



19 



Similarly, colour & purple are connected through Doc4, thus LSI is also expected 
to be able to reveal this relationship. Table 15.21 shows the result of rank-2 matrix 
approximation by using the SVD to the original matrix in table 15.11 with e denotes 
small positive number. Note that the rank is chosen based on the number of reference 
classes, i.e., author names ({mark, twain, samuel, clemens}) and colour related terms 
({purple, colour}). As shown, Docl, Doc2, and Doc3 are now indexing mark, twain, 
samuel, and clemens. Thus, any query containing any of these words will correctly 
retrieve the corresponding relevant documents. And similarly, Doc4 and Doc5 are 
now indexing purple and colour, so any query containing at least one of these words 
will correctly retrieve the corresponding relevant documents. 

Now, we will apply the NMF to the data matrix in table 15.1] and see whether this 
technique can solve the synonymy problems. Table 15.31 and 15.41 show rank-2 matrix 
approximations using NMFLS and NMFJK respectively. As shown, both algorithms 
correctly index the synonyms, and thus the NMF can also be used in solving the 
synonymy problems in this dataset. 

5.2. Polysemy problems. LSI technique is also expected to be able to solve 
polysemy — word with multiple unrelated meanings — problem. By using a synthetic 
dataset, we will describe how the standard LSI method and the NMF solve this 
problem, given that polyseme presents in unrelated documents. Table 15.51 gives an 
example of polyseme where bank can either refers to financial institution or area near 
river. By inspection, it is clear that the dataset contains two difii'erent topics: financial 
and river, with {Docl, Doc3, Doc5} & {money, bank, interest} are in the first topic; 
and {Doc2, Doc4, Doc6} & {bed, river, bank} are in the second topic. Note that the 
dataset is well-conditioned for describing the polysemy problem as bank presents in 
unrelated documents. 

If a query = [10010] containing {money, bank} is made to the original 
matrix A in table ISTSl then q|"A =[212111]. So, only Docl and Doc3 are 
recognized as relevant, and Doc5 will not be recognized as relevant. Similarly, if a 
query = [0 1 1 0] containing {river, bank} is made, then q^A =[121211]; 
only Doc2 and Doc4 are recognized as relevant, and Doc6 is not. 

Table 15.61 shows rank-2 SVD approximation to the original matrix. If the same 
queries are made to the matrix A in tableEH then qf A = [1.86726 1.00346 1.86726 
1.00346 1.40251 0.91744] and q^A = [1.00346 1.86726 1.00346 1.86726 0.91744 
1.40251 ]. Therefore, all relevant documents can be correctly retrieved, so LSI using 
the SVD can solve the polysemy problem in this dataset. 

Now, we will see whether the NMF can also solve the problem. Table [5771 and [5?8l 
show rank-2 NMF approximations using NMFLS and NMFJK respectively. Let Ai 
and A2 be the matrix in table 15.71 and 15.81 respectively. If the same queries are made 
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Table 5.5 

Dataset for describing polysemy problem. 



Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


Doc6 


money 


1 





1 











bed 





1 





1 





1 


river 





1 





1 








bank 


1 


1 


1 


1 


1 


1 


interest 


1 





1 





1 






Table 5.6 

LSI using the SVD for detecting poly seme. 



Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


Doc6 


money 


0.80882 


-0.054983 


0.80882 


-0.054983 


0.547139 


0.062068 


bed 


-0.023949 


1.08239 


-0.023949 


1.08239 


0.117052 


0.738319 


river 


-0.054983 


0.80882 


-0.054983 


0.80882 


0.062068 


0.547139 


bank 


1.05844 


1.05844 


1.05844 


1.05844 


0.855371 


0.855371 


interest 


1.08239 


-0.023949 


1.08239 


-0.023949 


0.738319 


0.117052 



Table 5.7 

LSI using the NMF (NMFLS) for detecting polyseme. 



Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


Doc6 


money 


0.801054 


0.013549 


0.800924 


0.013018 


0.558518 


0.082122 


bed 


0.013619 


1.082496 


0.014032 


1.082806 


0.098272 


0.748645 


rivcir 


0.010063 


0.804439 


0.01037 


0.80467 


0.072989 


0.556338 


bank 


1.067149 


1.063328 


1.067377 


1.062928 


0.829791 


0.831112 


interest 


1.080788 


0.018281 


1.080612 


0.017564 


0.753557 


0.110799 








Table 5.8 










LSI using the NMF (NMFJK) for detecting polyseme. 




Word 


Docl 


Doc2 


Doc3 


Doc4 


Doc5 


Doc6 


money 


0.63733 


0.040462 


0.63733 


0.040462 


0.441459 


0.098975 


bed 


0.054469 


0.858063 


0.054469 


0.858063 


0.133246 


0.594351 


river 


0.040458 


0.637333 


0.040458 


0.637333 


0.09897 


0.441459 


bank 


0.877087 


0.877088 


0.877087 


0.877088 


0.699338 


0.699339 


interest 


0.85806 


0.054476 


0.85806 


0.054476 


0.594352 


0.133253 



to these matrices, then qf Ai = [1.8682 1.07688 1.8683 1.07595 1.38831 0.91323], 
A2 = [1.51442 0.91755 1.51442 0.91755 1.1408 0.79831], q^^Ai = [1.07721 
1.86777 1.07775 1.8676 0.90278 1.38745], and qi'A2 = [0.91754 1.51442 0.91754 
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Table 5.9 
The standard text collections in LSI. 





Medline 


Cranfield 


CISI 


ADI 


#Doc 


1033 


1398 


1460 


82 


#Word 


12011 


6551 


9080 


1215 


%NNZ 


0.4567 


0.85674 


0.51701 


2.1479 


#Query 


30 


225 


35 


35 



1.51442 0.79831 1.1408]. Accordingly, the polysemy problem can also be solved by 
using the NMF in this dataset. 

5.3. Experimental results. We will now evaluate LSI aspect of the NMF by 
using real datasets, and compare the results with the results of the SVD. Table [HH] 
summarizes the dataset^ used in the experiments where #Doc, #Word, %NNZ, and 
#Query denote the number of documents, the number of unique words, percentage of 
nonnegative entries, and the number of predefined queries made to the corresponding 
word-by-document matrix respectively. These datasets are the standard text collec- 
tions which have been extensively used in the LSI researches. 

Each of the text collections comprises of three important files. The first file 
contains abstracts of the documents which each indexed by a unique identifier, the 
second file contains the list of queries each with a unique identifier, and the third 
file contains a dictionary that maps every query with its manually assigned relevant 
documents. 

The first file is the file that is used to construct the word-by-document matrix 
A G R*^^^. To extract the unique words, the stop words and words that shorter 
than two characters are removed. But we do not employ any stemming and do not 
remove words that only belong to one documents as in section 14.21 The reasons are 
the stemming process seems to be not popular in the LSI researches, and removing 
unique words in a document can potentially reduce recall since it is possible that 
queries contain these words. Then after A is constructed, we further adjust the entry 
weights by using logarithmic scale, i.e., Aij \og{Aij + 1), but do not normalized 
the columns of the matrix. This is because based on our pre-experimental results, the 
logarithmic scale performs better than the simple frequency of word occurences, and 
normalization has a negative effect on the retrieval performances for both the SVD 
and the NMF for all text collections. 

The second file is used to construct the query matrix Q S R^^^*^ = [qi, . . . , qg 
where Q denotes the number of queries (shown in the last row of table [5791) . M 



*http: / /web. eecs.utk.edu/research/lsi/ 
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denotes the number of unique words, and denotes the q-th query vector constructed 
from the file. So simply by multiplying Q with the corresponding A, one can get a 
matrix that contains scores that describe how relevant each query to the documents 
in the corresponding row. 

And the third file is the file that maps each query to its manually assigned relevant 
documents. This information will be utilized as the references to measure the retrieval 
performances of the SVD and the NMF. 

To measure the LSI performances, average precision — the standard metric in the 
IR researches [44j that measures /-point interpolated average pseudo-precision at re- 
call level [0, 1] — will be used. This metric captures both recall and precision concepts 
without inheriting the weakness from recall, i.e., perfect recall can be achieved by 
retrieving all documents. The following outlines the average precision definition, and 
more detailed discussions can be found in, e.g., ref. [42l l44l |45] . 

First the definition of precision will be discussed. Let r = q-^A denotes a vector 
that contains document scores with respect to the query vector q, and let r be sorted 
in reverse order (larger comes first). The precision at n-th document is given by: 



where r„ denotes the number of relevant documents up to n-th position. The pseudo- 
precision at recall level x G [0, 1] is defined as: 



where r^ denotes the total number of relevant documents in the collection. And 
/-interpolated average precision at recall level x G [0, 1] for a single query q is defined 
as: 



where as previously defined, n denotes the n-th position in r. We will use 11-point 
interpolated average precision (/ = 11) as proposed in ref. \42\ since three out of four 
text collections used in our experiments are similar to those used in ref. IT. However, 
due to the differences in the preprocessing steps, our results won't be similar to the 
results of ref. [12] . And because there are several queries in each text collection (shown 
in the last row of table [5^ . average precision used in this work is the average value 
over #Qucry. So, for each text collection: 



p{x) = max{p„ I X < r,JrN, n = 1, . . . , N}, 




n=0 



average precision = — 
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Fig. 5.1. Average precision values over decomposition ranks. 
Table 5.10 

Average values of the average precision over 10 trials. 





Medline 


Cranfield 


CISI 


ADI 


SVD 

NMFLS 

NMFJK 


0.4967(600) 

0.4769(600) 
0.4862(600) 


0.3365(600) 

0.2674(600) 
0.2871(600) 


0.1617(170) 

0.1510(530) 
0.1434(360) 


0.2663(33) 
0.2674(39) 

0.2610(40) 



where Q denotes # Query. 

Fig. Em shows the average precision values over decomposition ranks (for Medline, 
Cranfield, and CISI: [10, 20,. . .,600], and for ADI: [1, 2,. . .,40]) for all datasets. Table 
IS.lOl displavs average values of the average precision over 10 trials with the values are 
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in format val(rank), where val denotes the average value over 10 trials at this rank, 
and rank denotes the rank where the maximum average precision value is obtained at 
the first attempt (for example, in ADI at the first attempt the SVD reaches maximum 
average precision at rank 33, NMFLS at rank 39, and NMFJK at rank 40, and this 
is also the ranks where peak values are achieved in fig. 15.11 for each dataset and each 
method). Note that because approximate matrices produced by the SVD are unique, 
there is no need to repeat the computation. 

As shown in fig. 15.11 in general the SVD produces better and more stable average 
precision values over decomposition ranks for all datasets, with tendency the many the 
decomposition ranks the higher the average precision values. The average precision 
values produced by the NMF algorithms seem to be not stable, especially for CISI. 
NMFJK seems to have slightly better average precision than NMFLS. This results 
are interesting since as discussed in section 14.21 clustering capability of NMFJK is 
also better than NMFLS. 

While there are cases in which NMFLS and NMFJK outperform the SVD, when 
the computations are repeated over 10 trials and the results are averaged, as shown 
in table I5.10[ the superiority of NMFLS and NMFJK seems to be vanished. This can 
be understood since NMF algorithms when converged, only stationarity of the limit 
points are guaranteed (so not even local-optimality is guaranteed by NMF algorithms). 
On the other hand, SVD algorithms not only have global-optimality guarantee, but 
also produce the same factors with differences only in the numerical precision [42] (at 
least theoretically). 

Because rank-fc truncated SVD can be constructed from full rank SVD by taking 
the first k columns of the singular matrices and the k x k principal submatrix of the 
singular value matrix, the computational times for the SVD are not recorded for each 
decomposition rank, rather we compute full rank SVD for each dataset and record 
the times which are 498.59, 448.67, 237.46, and 1.1924 seconds for Medhne, Cranfield, 
CISI, and ADI respectively. And the computational times over decomposition ranks 
for NMFLS and NMFJK are shown in fig. O 

As shown in fig. 15. 2[ in general, for every dataset NMFJK is faster than NMFLS 
for lower ranks, but then the computational times of NMFJK are growing faster than 
NMFLS, resulting in slower performance for higher ranks. These results are inter- 
esting since according to the creator of NMFJK, this algorithm is the fastest NMF 
algorithm so far [24] . Table 14.71 can also be considered for evaluating the computa- 
tional times of NMFJK which in the Reuters datasets, NMFJK is faster than NMFLS. 
However, since the decomposition ranks are rather very small (up to 12), the results 
in 14.71 are in accord to the results in fig. 15.21 Thus, it seems that in lower ranks, 
NMFJK is faster than NMFLS, but in higher ranks NMFJK is slower than NMFLS. 
Table 15.111 shows the average computational times of NMFJK and NMFLS over 10 
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Fig. 5.2. Computational times over decomposition ranks (second). 

Table 5.11 
Average computational times over 10 trials. 
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0.4165 
0.5045 



trials for decomposition ranks shown in table 15.101 As the ranks are all high, NMFJK 
is slower than NMFLS for all datasets. 



6. Conclusions. We have presented a theoretical framework for supporting clus- 
tering aspect of the NMF without setting the KKT multipliers to zeros. Thus the 



26 



A. Mirzal 



stationary point used in proving this aspect is guaranteed to be on the nonnegative 
orthant which is the feasible region of the NMF. Our theoretical work implies a lim- 
itation of the NMF as a clustering method in which it cannot be used in clustering 
linearly inseparable datasets. So, the NMF as a clustering method is more resembling 
k-means clustering or SVM than the spectral clustering, even though both the NMF 
and the spectral methods utilize matrix decomposition techniques. As the clustering 
capabilities of k-mcans and SVM usually can be improved by using the kernel meth- 
ods, probably the same approach can also be employed in the NMF. We will address 
this issue in our future researches. 

Clustering capability of NMFJK is comparable to the SVD in Reuters datasets 
with NMFJK tends to be better for small ^cluster and the SVD for big #cluster. 
But unfortunately, NMFLS which is the standard NMF algorithm cannot outperform 
the SVD. These results imply clustering aspect of the NMF is algorithm-dependent, 
a fact that seems to be overlooked in the NMF researches. 

LSI aspect of the NMF seems to be comparable to the SVD in its power for 
solving synonymy and polysemy problems for datasets with clear semantic structures 
that allowed these problems to be revealed. In real datasets, however, the NMF 
generally cannot outperform the SVD. B\it an interesting fact comes into sight; in 
some cases, the NMF can outperform the SVD, even though when the computations 
are repeated and averaged over the number of trials, these advantages vanish. Because 
the NMF can offer different results depending on the algorithms, the initializations, 
the objectives, and the problems, improving LSI capability of the NMF is possible. 
We will address this problem in our future researches. 
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